-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13801][SQL] DataFrame.col should return unresolved attribute #11632
Conversation
@@ -115,7 +115,7 @@ class DataFrameJoinSuite extends QueryTest with SharedSQLContext { | |||
Row("1", 1) :: Row("2", 1) :: Row("3", 1) :: Nil) | |||
} | |||
|
|||
test("[SPARK-6231] join - self join auto resolve ambiguity") { | |||
ignore("[SPARK-6231] join - self join auto resolve ambiguity") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, I'm not sure this "auto resolve ambiguity" feature is useful anymore.
Test build #52827 has finished for PR 11632 at commit
|
The major problem I see is that now any two DataFrames must be aliased first before being joined. This breaks existing user code. However, this PR does solve a more common and subtle problem that is hard to debug, especially when the join condition is not as trivial as |
I like the idea. If we do not change the current behavior, issuing an error with name ambiguity is just confusing. Maybe we just document it. It might be enough. However, more importantly, it could generate a wrong result. See the example val df1 = Seq((1, 3), (2, 1)).toDF("keyCol1", "keyCol2")
val df2 = Seq((1, 4), (2, 1)).toDF("keyCol1", "keyCol3")
val df3 = df1.join(df2, df1("keyCol1") === df2("keyCol1")).select(df1("keyCol1"), $"keyCol3")
df3.join(df1, df3("keyCol3") === df1("keyCol1")).show() The above query returns an empty result set. |
5eb1b5d
to
2a0122e
Compare
Test build #53006 has finished for PR 11632 at commit
|
Test build #53032 has finished for PR 11632 at commit
|
Test build #53056 has finished for PR 11632 at commit
|
Test build #53069 has finished for PR 11632 at commit
|
Test build #53073 has finished for PR 11632 at commit
|
Test build #53196 has finished for PR 11632 at commit
|
Test build #53261 has finished for PR 11632 at commit
|
Test build #53288 has finished for PR 11632 at commit
|
Test build #53671 has finished for PR 11632 at commit
|
Test build #53672 has finished for PR 11632 at commit
|
Test build #53891 has finished for PR 11632 at commit
|
Test build #53894 has finished for PR 11632 at commit
|
Test build #53899 has finished for PR 11632 at commit
|
Test build #53914 has finished for PR 11632 at commit
|
Great work :) It seems that user intentions in the example query is obvious to some extent, so I wonder if we can automatically fix this kind of name ambiguity in an |
While this may help with join ambiguity. I think the more fundamental problem is that a transformed DataFrame should not be giving the same column references as the original. For an example where this is still a problem even with this patch see @wesm recent blog post http://wesmckinney.com/blog/compiling-dataframe-code/ |
Test build #54134 has finished for PR 11632 at commit
|
Test build #54313 has finished for PR 11632 at commit
|
retest this please |
Test build #54315 has finished for PR 11632 at commit
|
retest this please |
Test build #54317 has finished for PR 11632 at commit
|
Test build #54316 has finished for PR 11632 at commit
|
Test build #54321 has finished for PR 11632 at commit
|
Test build #54403 has finished for PR 11632 at commit
|
Shall we close this? |
What changes were proposed in this pull request?
Let's start with an example:
This query won't work and returns wrong result as
df("key")
anddf2("key")
reference to a same column.I think the biggest problem is, we give users the resolved attribute. However, resolved attribute is not real column, as logical plan's output may change. For example, we will generate new output for the right child in self-join.
We should make
Dataset.col
return unresolved attribute, and still do the resolution to make sure the given column name is resolvable, but don't return the resolved one, just get the name out and wrap it with UnresolvedAttribute.We should also alias every single
Dataset
ever generated, with a globally unique name, so thatDataset.col
will return an unresolved attribute with a unique qualifier, and our framework can reference the corrected column with it.When we alias a
Dataset
, we should clean out previously generated aliases, to avoid making the plan tree too big.Now if users run the query again, it works.
There are 3 exceptional cases that we should not alias:
Dataset.as
: this is obvious, we can't override user-provided alias.Dataset.toDF()
: this just turns a strongly typedDataset
to genericDataFrame
, not creating a new one, no need to alias.join
:join
is very special, it may have same-name outputs with different qualifiers, so we can't alias it as we will replace the original qualifiers and make the same-name outputs un-distinguishable. We can't clean out previously generated aliases either, as it will erase the original qualifiers.However, if users manually introduce ambiguity, this PR will break existing code, for example:
It will fail analysis after this PR, because the 2
id
columns have same qualifier.This PR also breaks some corner cases that works before, but I think it's good to break it:
It will throw an exception now, because we can't resolve
df("key")
based ondf1
. I think this is reasonable,df
anddf1
are differentDataset
s, it's weird to support select a column from differentDataset
. Another case:It doesn't make sense to support self-join in this way, users should use
df.join(df, "key")
. And this trick is very fragile, it stops working if we do some operation on the column, e.g.df.join(df, (df("key") + 1) === df("key"))
How was this patch tested?
existing tests.